热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

电子书|Spark流处理|MasteringStructuredStreamingandSparkStreaming

MasteringStructuredStreamingandSpar

点击关注上方“知了小巷”,

设为“置顶或星标”,第一时间送达干货。

虽是英文版,但内容丰富细致,值得一看,先看完整目录吧,文末有获取提示。

Part I. Fundamentals of Stream Processing with Apache Spark

1. Introducing Stream Processing

What Is Stream Processing?

Batch Versus Stream Processing

The Notion of Time in Stream Processing

The Factor of Uncertainty

Some Examples of Stream Processing

Scaling Up Data Processing

MapReduce

The Lesson Learned: Scalability and Fault Tolerance

Distributed Stream Processing

Stateful Stream Processing in a Distributed System

Introducing Apache Spark

The First Wave: Functional APIs

The Second Wave: SQL

A Unified Engine

Spark Components

Spark Streaming

Structured Streaming

Where Next?

2. Stream-Processing Model

Sources and Sinks

Immutable Streams Defined from One Another

Transformations and Aggregations

Window Aggregations

Tumbling Windows

Sliding Windows

Stateless and Stateful Processing

Stateful Streams

An Example: Local Stateful Computation in Scala

A Stateless Definition of the Fibonacci Sequence as a Stream

Transformation

Stateless or Stateful Streaming

The Effect of Time

Computing on Timestamped Events

Timestamps as the Provider of the Notion of Time

Event Time Versus Processing Time

Computing with a Watermark

Summary

3. Streaming Architectures

Components of a Data Platform

Architectural Models

The Use of a Batch-Processing Component in a Streaming Application

Referential Streaming Architectures

The Lambda Architecture

The Kappa Architecture

Streaming Versus Batch Algorithms

Streaming Algorithms Are Sometimes Completely Different in Nature

Streaming Algorithms Can’t Be Guaranteed to Measure Well Against

Batch Algorithms

Summary

4. Apache Spark as a Stream-Processing Engine

The Tale of Two APIs

Spark’s Memory Usage

Failure Recovery

Lazy Evaluation

Cache Hints

Understanding Latency

Throughput-Oriented Processing

Spark’s Polyglot API

Fast Implementation of Data Analysis 50

To Learn More About Spark 51

Summary 51

5. Spark’s Distributed Processing Model

Running Apache Spark with a Cluster Manager

Examples of Cluster Managers

Spark’s Own Cluster Manager

Understanding Resilience and Fault Tolerance in a Distributed System

Fault Recovery

Cluster Manager Support for Fault Tolerance

Data Delivery Semantics

Microbatching and One-Element-at-a-Time

Microbatching: An Application of Bulk-Synchronous Processing

One-Record-at-a-Time Processing

Microbatching Versus One-at-a-Time: The Trade-Offs

Bringing Microbatch and One-Record-at-a-Time Closer Together

Dynamic Batch Interval

Structured Streaming Processing Model

The Disappearance of the Batch Interval

6. Spark’s Resilience Model

Resilient Distributed Datasets in Spark

Spark Components

Spark’s Fault-Tolerance Guarantees

Task Failure Recovery

Stage Failure Recovery

Driver Failure Recovery

Summary


Part II. Structured Streaming

7. Introducing Structured Streaming

First Steps with Structured Streaming

Batch Analytics

Streaming Analytics

Connecting to a Stream

Preparing the Data in the Stream

Operations on Streaming Dataset

Creating a Query

Start the Stream Processing

Exploring the Data

Summary

8. The Structured Streaming Programming Model

Initializing Spark

Sources: Acquiring Streaming Data

Available Sources

Transforming Streaming Data

Streaming API Restrictions on the DataFrame API

Sinks: Output the Resulting Data

format

outputMode

queryName

option

options

trigger

start()

Summary

9. Structured Streaming in Action

Consuming a Streaming Source

Application Logic

Writing to a Streaming Sink

Summary

10. Structured Streaming Sources

Understanding Sources

Reliable Sources Must Be Replayable

Sources Must Provide a Schema

Available Sources

The File Source

Specifying a File Format

Common Options

Common Text Parsing Options (CSV, JSON)

JSON File Source Format

CSV File Source Format

Parquet File Source Format

Text File Source Format

The Kafka Source

Setting Up a Kafka Source

Selecting a Topic Subscription Method

Configuring Kafka Source Options

Kafka Consumer Options

The Socket Source

Configuration

Operations

The Rate Source

Options

11. Structured Streaming Sinks

Understanding Sinks

Available Sinks

Reliable Sinks

Sinks for Experimentation

The Sink API

Exploring Sinks in Detail

The File Sink

Using Triggers with the File Sink

Common Configuration Options Across All Supported File Formats

Common Time and Date Formatting (CSV, JSON)

The CSV Format of the File Sink

The JSON File Sink Format

The Parquet File Sink Format

The Text File Sink Format

The Kafka Sink

Understanding the Kafka Publish Model

Using the Kafka Sink

The Memory Sink

Output Modes

The Console Sink

Options

Output Modes

The Foreach Sink

The ForeachWriter Interface

TCP Writer Sink: A Practical ForeachWriter Example

The Moral of this Example

Troubleshooting ForeachWriter Serialization Issues

12. Event Time–Based Stream Processing

Understanding Event Time in Structured Streaming

Using Event Time

Processing Time

Watermarks

Time-Based Window Aggregations

Defining Time-Based Windows

Understanding How Intervals Are Computed

Using Composite Aggregation Keys

Tumbling and Sliding Windows

Record Deduplication

Summary

13. Advanced Stateful Operations

Example: Car Fleet Management

Understanding Group with State Operations

Internal State Flow

Using MapGroupsWithState

Using FlatMapGroupsWithState

Output Modes

Managing State Over Time

Summary

14. Monitoring Structured Streaming Applications

The Spark Metrics Subsystem

Structured Streaming Metrics

The StreamingQuery Instance

Getting Metrics with StreamingQueryProgress

The StreamingQueryListener Interface

Implementing a StreamingQueryListener

15. Experimental Areas: Continuous Processing and Machine Learning

Continuous Processing

Understanding Continuous Processing

Using Continuous Processing

Limitations

Machine Learning

Learning Versus Exploiting

Applying a Machine Learning Model to a Stream

Example: Estimating Room Occupancy by Using Ambient Sensors

Online Training


Part III. Spark Streaming

16. Introducing Spark Streaming

The DStream Abstraction

DStreams as a Programming Model

DStreams as an Execution Model

The Structure of a Spark Streaming Application

Creating the Spark Streaming Context

Defining a DStream

Defining Output Operations

Starting the Spark Streaming Context

Stopping the Streaming Process

Summary

17. The Spark Streaming Programming Model

RDDs as the Underlying Abstraction for DStreams

Understanding DStream Transformations

Element-Centric DStream Transformations

RDD-Centric DStream Transformations

Counting

Structure-Changing Transformations

Summary

18. The Spark Streaming Execution Model

The Bulk-Synchronous Architecture

The Receiver Model

The Receiver API

How Receivers Work

The Receiver’s Data Flow

The Internal Data Resilience

Receiver Parallelism

Balancing Resources: Receivers Versus Processing Cores

Achieving Zero Data Loss with the Write-Ahead Log

The Receiverless or Direct Model

Summary

19. Spark Streaming Sources

Types of Sources

Basic Sources

Receiver-Based Sources

Direct Sources

Commonly Used Sources

The File Source

How It Works

The Queue Source

How It Works

Using a Queue Source for Unit Testing

A Simpler Alternative to the Queue Source: The ConstantInputDStream

The Socket Source

How It Works

The Kafka Source

Using the Kafka Source

How It Works

Where to Find More Sources

20. Spark Streaming Sinks

Output Operations

Built-In Output Operations

print

saveAsxyz

foreachRDD

Using foreachRDD as a Programmable Sink

Third-Party Output Operations

21. Time-Based Stream Processing

Window Aggregations

Tumbling Windows

Window Length Versus Batch Interval

Sliding Windows

Sliding Windows Versus Batch Interval

Sliding Windows Versus Tumbling Windows

Using Windows Versus Longer Batch Intervals

Window Reductions

reduceByWindow

reduceByKeyAndWindow

countByWindow

countByValueAndWindow

Invertible Window Aggregations

Slicing Streams

Summary

22. Arbitrary Stateful Streaming Computation

Statefulness at the Scale of a Stream

updateStateByKey

Limitation of updateStateByKey

Performance

Memory Usage

Introducing Stateful Computation with mapwithState

Using mapWithState

Event-Time Stream Computation Using mapWithState

23. Working with Spark SQL

Spark SQL

Accessing Spark SQL Functions from Spark Streaming

Example: Writing Streaming Data to Parquet

Dealing with Data at Rest

Using Join to Enrich the Input Stream

Join Optimizations

Updating Reference Datasets in a Streaming Application

Enhancing Our Example with a Reference Dataset

Summary

24. Checkpointing

Understanding the Use of Checkpoints

Checkpointing DStreams

Recovery from a Checkpoint

Limitations

The Cost of Checkpointing

Checkpoint Tuning

25. Monitoring Spark Streaming

The Streaming UI

Understanding Job Performance Using the Streaming UI

Input Rate Chart

Scheduling Delay Chart

Processing Time Chart

Total Delay Chart

Batch Details

The Monitoring REST API

Using the Monitoring REST API

Information Exposed by the Monitoring REST API

The Metrics Subsystem

The Internal Event Bus

Interacting with the Event Bus

Summary

26. Performance Tuning

The Performance Balance of Spark Streaming

The Relationship Between Batch Interval and Processing Delay

The Last Moments of a Failing Job

Going Deeper: Scheduling Delay and Processing Delay

Checkpoint Influence in Processing Time

External Factors that Influence the Job’s Performance

How to Improve Performance?

Tweaking the Batch Interval

Limiting the Data Ingress with Fixed-Rate Throttling

Backpressure

Dynamic Throttling

Tuning the Backpressure PID

Custom Rate Estimator

A Note on Alternative Dynamic Handling Strategies

Caching

Speculative Execution


Part IV. Advanced Spark Streaming Techniques

27. Streaming Approximation and Sampling Algorithms

Exactness, Real Time, and Big Data

Exactness

Real-Time Processing

Big Data

The Exactness, Real-Time, and Big Data triangle

Big Data and Real Time

Approximation Algorithms

Hashing and Sketching: An Introduction

Counting Distinct Elements: HyperLogLog

Role-Playing Exercise: If We Were a System Administrator

Practical HyperLogLog in Spark

Counting Element Frequency: Count Min Sketches

Introducing Bloom Filters

Bloom Filters with Spark

Computing Frequencies with a Count-Min Sketch

Ranks and Quantiles: T-Digest

T-Digest in Spark

Reducing the Number of Elements: Sampling

Random Sampling

Stratified Sampling

28. Real-Time Machine Learning

Streaming Classification with Naive Bayes

streamDM Introduction

Naive Bayes in Practice

Training a Movie Review Classifier

Introducing Decision Trees

Hoeffding Trees

Hoeffding Trees in Spark, in Practice

Streaming Clustering with Online K-Means

K-Means Clustering

Online Data and K-Means

The Problem of Decaying Clusters

Streaming K-Means with Spark Streaming


Part V. Beyond Apache Spark

29. Other Distributed Real-Time Stream Processing Systems

Apache Storm

Processing Model

The Storm Topology

The Storm Cluster

Compared to Spark

Apache Flink

A Streaming-First Framework

Compared to Spark

Kafka Streams

Kafka Streams Programming Model

Compared to Spark

In the Cloud

Amazon Kinesis on AWS

Microsoft Azure Stream Analytics

Apache Beam/Google Cloud Dataflow

30. Looking Ahead

Stay Plugged In

Seek Help on Stack Overflow

Start Discussions on the Mailing Lists

Attend Conferences

Attend Meetups

Read Books

Contributing to the Apache Spark Project


有5本PDF,公众号后台回复 ss


往期推荐:


Spark Core之Shuffle解析

数据模型⽆法复⽤,归根结底还是设计问题

数据仓库、数据湖、流批一体,终于有大神讲清楚了!

如何统⼀管理纷繁杂乱的数据指标?

项目管理实战20讲笔记(网易-雷蓓蓓)

元数据中⼼的关键⽬标和技术实现⽅案

Hive程序相关规范-有助于调优

HBase内部探险-数据模型

HBase内部探险-HBase是怎么存储数据的

HBase内部探险-一个KeyValue的历险

数据中台到底怎么建设呢?

到底什么样的企业应该建设数据中台?

数据中台到底是不是大数据的下一站?




推荐阅读
  • CF:3D City Model(小思维)问题解析和代码实现
    本文通过解析CF:3D City Model问题,介绍了问题的背景和要求,并给出了相应的代码实现。该问题涉及到在一个矩形的网格上建造城市的情景,每个网格单元可以作为建筑的基础,建筑由多个立方体叠加而成。文章详细讲解了问题的解决思路,并给出了相应的代码实现供读者参考。 ... [详细]
  • 本文介绍了设计师伊振华受邀参与沈阳市智慧城市运行管理中心项目的整体设计,并以数字赋能和创新驱动高质量发展的理念,建设了集成、智慧、高效的一体化城市综合管理平台,促进了城市的数字化转型。该中心被称为当代城市的智能心脏,为沈阳市的智慧城市建设做出了重要贡献。 ... [详细]
  • mysql-cluster集群sql节点高可用keepalived的故障处理过程
    本文描述了mysql-cluster集群sql节点高可用keepalived的故障处理过程,包括故障发生时间、故障描述、故障分析等内容。根据keepalived的日志分析,发现bogus VRRP packet received on eth0 !!!等错误信息,进而导致vip地址失效,使得mysql-cluster的api无法访问。针对这个问题,本文提供了相应的解决方案。 ... [详细]
  • 深入理解Kafka服务端请求队列中请求的处理
    本文深入分析了Kafka服务端请求队列中请求的处理过程,详细介绍了请求的封装和放入请求队列的过程,以及处理请求的线程池的创建和容量设置。通过场景分析、图示说明和源码分析,帮助读者更好地理解Kafka服务端的工作原理。 ... [详细]
  • 合并列值-合并为一列问题需求:createtabletab(Aint,Bint,Cint)inserttabselect1,2,3unionallsel ... [详细]
  • 什么是大数据lambda架构
    一、什么是Lambda架构Lambda架构由Storm的作者[NathanMarz]提出,根据维基百科的定义,Lambda架构的设计是为了在处理大规模数 ... [详细]
  • 生成式对抗网络模型综述摘要生成式对抗网络模型(GAN)是基于深度学习的一种强大的生成模型,可以应用于计算机视觉、自然语言处理、半监督学习等重要领域。生成式对抗网络 ... [详细]
  • 本文讨论了在Windows 8上安装gvim中插件时出现的错误加载问题。作者将EasyMotion插件放在了正确的位置,但加载时却出现了错误。作者提供了下载链接和之前放置插件的位置,并列出了出现的错误信息。 ... [详细]
  • CSS3选择器的使用方法详解,提高Web开发效率和精准度
    本文详细介绍了CSS3新增的选择器方法,包括属性选择器的使用。通过CSS3选择器,可以提高Web开发的效率和精准度,使得查找元素更加方便和快捷。同时,本文还对属性选择器的各种用法进行了详细解释,并给出了相应的代码示例。通过学习本文,读者可以更好地掌握CSS3选择器的使用方法,提升自己的Web开发能力。 ... [详细]
  • 本文介绍了九度OnlineJudge中的1002题目“Grading”的解决方法。该题目要求设计一个公平的评分过程,将每个考题分配给3个独立的专家,如果他们的评分不一致,则需要请一位裁判做出最终决定。文章详细描述了评分规则,并给出了解决该问题的程序。 ... [详细]
  • Android Studio Bumblebee | 2021.1.1(大黄蜂版本使用介绍)
    本文介绍了Android Studio Bumblebee | 2021.1.1(大黄蜂版本)的使用方法和相关知识,包括Gradle的介绍、设备管理器的配置、无线调试、新版本问题等内容。同时还提供了更新版本的下载地址和启动页面截图。 ... [详细]
  • 本文介绍了Oracle数据库中tnsnames.ora文件的作用和配置方法。tnsnames.ora文件在数据库启动过程中会被读取,用于解析LOCAL_LISTENER,并且与侦听无关。文章还提供了配置LOCAL_LISTENER和1522端口的示例,并展示了listener.ora文件的内容。 ... [详细]
  • 本文介绍了一个在线急等问题解决方法,即如何统计数据库中某个字段下的所有数据,并将结果显示在文本框里。作者提到了自己是一个菜鸟,希望能够得到帮助。作者使用的是ACCESS数据库,并且给出了一个例子,希望得到的结果是560。作者还提到自己已经尝试了使用"select sum(字段2) from 表名"的语句,得到的结果是650,但不知道如何得到560。希望能够得到解决方案。 ... [详细]
  • ALTERTABLE通过更改、添加、除去列和约束,或者通过启用或禁用约束和触发器来更改表的定义。语法ALTERTABLEtable{[ALTERCOLUMNcolu ... [详细]
  • 深度学习中的Vision Transformer (ViT)详解
    本文详细介绍了深度学习中的Vision Transformer (ViT)方法。首先介绍了相关工作和ViT的基本原理,包括图像块嵌入、可学习的嵌入、位置嵌入和Transformer编码器等。接着讨论了ViT的张量维度变化、归纳偏置与混合架构、微调及更高分辨率等方面。最后给出了实验结果和相关代码的链接。本文的研究表明,对于CV任务,直接应用纯Transformer架构于图像块序列是可行的,无需依赖于卷积网络。 ... [详细]
author-avatar
云聪京初瑞子_617
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有